XeHE: An Intel GPU Accelerated Fully Homomorphic Encryption Library by Alexander Lyashevsky Alexey Titov Yiqin Qiu and Yujia Zhai

XeHE: An Intel GPU Accelerated Fully Homomorphic Encryption Library by Alexander Lyashevsky Alexey Titov Yiqin Qiu and Yujia Zhai

Author:Alexander Lyashevsky, Alexey Titov, Yiqin Qiu, and Yujia Zhai
Language: eng
Format: epub
Publisher: Alexander Lyashevsky, Alexey Titov, Yiqin Qiu, and Yujia Zhai
Published: 2023-04-04T18:55:39+00:00


Multi-tile Scaling

Intel packages multiple computing tiles on a single board for scalable performance (Blythe 2020). Due to underlying complexities, implicitly support the multi-tile submission at full performance cannot be counted upon on all platforms. As we showed, this was quite evident in our experiences. In general, applications do best when designed to spread work across multiple queues in ways that can easily be matched to the optimal ways to use a particular platform. This built-in flexibility in an application leads to more portable and performance portable code. In our case, knowing that the memory independent workloads will not be distributed over all tiles of a multi-tile Intel GPU automatically influenced us to adopt a more portable structure in our implementation. In order to maximize the utilization of multi-tile devices, XeHE maintains one queue for each tile and submit workloads to different queues. Listing 8 shows the implementation details of the library’s multi-queue SYCL context: it checks whether multi-tile is supported on current device via SYCL partition functions, creates in-order queues for each (sub-)device (tile), and attaches the queues to the corresponding (sub-)device.

XeHE library achieves explicit multi-tile scaling by submitting workloads to the multiple queues, utilizing all the sub-devices initialized at SYCL context creation. Workloads on different queues are assumed to be memory independent. The assumption is achievable by submitting independent HE computation graphs to different queues. That reflects real world applications where different clients always send independent computation requests. The assumption simplifies the memory management across multiple-tile device and supports a separate memory cache for each queue as mentioned in the section above. Also, exploiting the advantage of fast tile-to-tile shared memory, we can load the shared data, such as security parameter context, only on a specific tile at initialization and share it across the tiles at run-time. This will reduce initialization overhead and simplify the code structure without losing run-time performance.

Listing 8 DPC++ Context with multiple queue

class Context { bool igpu = true; std::vector<cl::sycl::queue> _queues; void generate_queue(bool select_gpu = true){ if (select_gpu) { sycl::device RootDevice = sycl::device( intel_gpu_selector()); std::vector<sycl::device> SubDevices; try { // check if sub devices (tile split) is supported on GPU device SubDevices = RootDevice.create_sub_devices <sycl::info::partition_property::partition_by_affinity_domain> (sycl::info::partition_affinity_domain::next_partitionable); } catch (...) { std::cout << "Sub devices are not supported\n"; // only use the root device SubDevices.push_back(RootDevice); } // create in-order queues and attach to sub-devices sycl::context C(SubDevices); for (auto &D : SubDevices) { sycl::queue q; q = sycl::queue(C, D, sycl::property::queue::in_order()); _queues.push_back(q); } } else { // create queue based on CPU device ... } } public: Context(bool select_gpu = true){ generate_queue(select_gpu); =} ... };



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.